Single-cell sequencing uses tiny nanoliter droplets that are created using microfluidics. Ideally, each droplet contains one individual cell and one bead that has a molecular barcode, used to uniquely label transcripts from that cell.
If the volume of each droplet is 0.5 nanoliter, and cells are present at a constant concentration of 100 cells per microliter, what is the rate of cells included per droplet? (Hint: Remember to convert units!!!)
# Calculate the rate of the event (a cell being present in a droplet)
# Convert ul to nl: conc = (100/ul)/(1000/nl) = 0.1 cell / nl
cell.droplet.rate <- 0.1 * 0.5
print(paste("Rate is", cell.droplet.rate, "cells per droplet"))
## [1] "Rate is 0.05 cells per droplet"
What is the expected percentage of droplets that don’t have even a single cell?
print(paste0((1 - cell.droplet.rate)*100, "% of droplets do not contain a cell."))
## [1] "95% of droplets do not contain a cell."
Typically, an experiment might start out with around one million cells. Is the Poisson distribution a good model for these data?
# your answer here
The Poisson distribution measures the number of events that occur in a fixed interval when events occur at a constant rate. The event here is a cell being included in a droplet, concentration is the known rate of events occurring, and the droplet size is a fixed interval.
Since p is low and n is high, and there may be more than one event per interval, the Poisson is a good model for this experimental setup.
Plot the PMF for the Poisson distribution using this rate (the PMF is a PDF for a discrete distribution):
pois family of functions to generate an ideal Poisson PMF with the appropriate lambda parameter across a range of 0-10 cells.Cell_count and Density.ggplot to plot a histogram of the probability distribution of cell counts per droplet.
stat="identity" in the geom layerscale_x_continuous(breaks=pretty_breaks()) (pretty breaks is from the scales package)# ideal probability mass function
range = c(0:10)
cell.droplet.pmf <- dpois(range, cell.droplet.rate)
# Check the sum of the PMF
cell.droplet.pmf.sum <- sum(cell.droplet.pmf)
cell.droplet.pmf.sum
## [1] 1
# Create the data frame
cell.droplet.pmf.df <- data.frame(Cell_count = range,
Density = cell.droplet.pmf)
cell.droplet.pmf.df
## Cell_count Density
## 1 0 9.512294e-01
## 2 1 4.756147e-02
## 3 2 1.189037e-03
## 4 3 1.981728e-05
## 5 4 2.477160e-07
## 6 5 2.477160e-09
## 7 6 2.064300e-11
## 8 7 1.474500e-13
## 9 8 9.215625e-16
## 10 9 5.119792e-18
## 11 10 2.559896e-20
# Plot the distribution
ggplot(cell.droplet.pmf.df, aes(x=Cell_count, y=Density)) +
geom_histogram(stat="identity", fill="deepskyblue", color="black", alpha=0.25) +
theme_classic() +
# Use the `pretty_breaks()` function from the scales package to make the X-axis cleaner
scale_x_continuous(breaks=pretty_breaks()) +
xlab("Cells per droplet")
## Warning: Ignoring unknown parameters: binwidth, bins, pad
The experimentalist can control the concentration of cells in this experiment. Plot the distribution of cell counts for experiments with 100, 1000, or 2000 cells/microliter.
Generate PDF data for each concentration of cells/ul
Make a data frame containing the distributions
Convert the data frame to long format before plotting
melt from the reshape package:
Use ggplot2 to make a histogram showing all three distributions:
stat="identity" in the histogram layerposition="dodge2" to position the bars next to each other instead of superimposed# rate per droplet
mu = cell.droplet.rate
mu
## [1] 0.05
# generate the PDFs
cell.droplet.pmf.n100 = dpois(range, mu) # 100
cell.droplet.pmf.n1000 = dpois(range, mu * 10) # 1k
cell.droplet.pmf.n2000 = dpois(range, mu * 20) # 2k
# Create a data frame with one column for the range vector and one for each dataset
q1c.data.frame <- data.frame(Cell_count = range,
n100 = cell.droplet.pmf.n100,
n1000 = cell.droplet.pmf.n1000,
n2000 = cell.droplet.pmf.n2000)
# look at the first few lines of the data.frame
head(q1c.data.frame)
## Cell_count n100 n1000 n2000
## 1 0 9.512294e-01 0.6065306597 0.367879441
## 2 1 4.756147e-02 0.3032653299 0.367879441
## 3 2 1.189037e-03 0.0758163325 0.183939721
## 4 3 1.981728e-05 0.0126360554 0.061313240
## 5 4 2.477160e-07 0.0015795069 0.015328310
## 6 5 2.477160e-09 0.0001579507 0.003065662
# Melt the dataframe for ggplot to convert the data to "long" format
# This allows us to group the data by Concentration for plotting
# Specify the column names:
q1c.data.frame.melted <- melt(q1c.data.frame,
id.vars = "Cell_count", # x-axis
value.name = "Density", # y-axis
variable.name = "Concentration") # categories (plot key)
# look at the first few lines of the melted data frame
head(q1c.data.frame.melted)
## Cell_count Concentration Density
## 1 0 n100 9.512294e-01
## 2 1 n100 4.756147e-02
## 3 2 n100 1.189037e-03
## 4 3 n100 1.981728e-05
## 5 4 n100 2.477160e-07
## 6 5 n100 2.477160e-09
# Make the plot
ggplot(q1c.data.frame.melted, aes(x=Cell_count, y=Density, fill=Concentration)) +
geom_histogram(stat="identity", position="dodge2", alpha=0.5) +
scale_fill_manual(values = c("goldenrod","red","blue")) +
theme_classic() +
scale_x_continuous(breaks=pretty_breaks()) +
xlab("Cells per droplet")
## Warning: Ignoring unknown parameters: binwidth, bins, pad
Assuming experiment produces one million (1e6) droplets, convert the density plot above to a histogram showing the number of droplets per experiment that contain 0-10 cells per droplet for each concentration. Add a label to the y-axis indicating that it now represents the number of droplets instead of the density.
# Make the plot
ggplot(q1c.data.frame.melted, aes(x=Cell_count, y=Density*1e6, fill=Concentration)) +
geom_histogram(stat="identity", position="dodge2", alpha=0.5) +
scale_fill_manual(values = c("goldenrod","red","blue")) +
theme_classic() +
ylab("Number of droplets") +
scale_x_continuous(breaks=pretty_breaks())
## Warning: Ignoring unknown parameters: binwidth, bins, pad
To make the next few steps easier, now go ahead and add a column to your melted data frame containing the number of droplets out of 1e6 corresponding to the density for each datapoint. Take the sum of the droplet counts for each concentration to make sure this column adds up correctly.
q1c.data.frame.melted = q1c.data.frame.melted %>% mutate(Expected_droplets = Density*1e6)
head(q1c.data.frame.melted)
## Cell_count Concentration Density Expected_droplets
## 1 0 n100 9.512294e-01 9.512294e+05
## 2 1 n100 4.756147e-02 4.756147e+04
## 3 2 n100 1.189037e-03 1.189037e+03
## 4 3 n100 1.981728e-05 1.981728e+01
## 5 4 n100 2.477160e-07 2.477160e-01
## 6 5 n100 2.477160e-09 2.477160e-03
q1c.data.frame.melted %>% filter(Concentration == "n100") %>% select(Expected_droplets) %>% sum
## [1] 1e+06
q1c.data.frame.melted %>% filter(Concentration == "n1000") %>% select(Expected_droplets) %>% sum
## [1] 1e+06
q1c.data.frame.melted %>% filter(Concentration == "n2000") %>% select(Expected_droplets) %>% sum
## [1] 1e+06
Select just the rows with 1 cell, and then plot the number of droplets that are expected to contain only one cell for each concentration. Instead of using geom_histogram(), try using geom_col().
# filter the data
q1f.melted.1cell <- filter(q1c.data.frame.melted, Cell_count == 1)
q1f.melted.1cell
## Cell_count Concentration Density Expected_droplets
## 1 1 n100 0.04756147 47561.47
## 2 1 n1000 0.30326533 303265.33
## 3 1 n2000 0.36787944 367879.44
# Plot the data as a bar plot
ggplot(q1f.melted.1cell,
aes(x=Concentration, y=Expected_droplets, fill=Expected_droplets)) +
geom_col() +
theme_classic()
Which concentration produces the largest number of droplets with one cell?
# your answer here
200 cells / microliter gives the largest number of droplets with exactly one cell.
The optimal cell concentration maximizes the number of droplets with a single cell while at theh same time minimizing the rate of ‘doublets’, which have 2 or more cells in a single droplet.
Plot the number of droplets that have 1 cell and the number that have 2+ cells for each experimental concentration. To do this, we suggest the following steps (but you can do this any way you want):
aggregate function to do this, using the formula . ~ Concentration.Whatever way you choose to do this, make sure to comment your code!
# filter the data
q1g.melted.2_plus_cells = filter(q1c.data.frame.melted, Cell_count > 1)
q1g.melted.2_plus_cells
## Cell_count Concentration Density Expected_droplets
## 1 2 n100 1.189037e-03 1.189037e+03
## 2 3 n100 1.981728e-05 1.981728e+01
## 3 4 n100 2.477160e-07 2.477160e-01
## 4 5 n100 2.477160e-09 2.477160e-03
## 5 6 n100 2.064300e-11 2.064300e-05
## 6 7 n100 1.474500e-13 1.474500e-07
## 7 8 n100 9.215625e-16 9.215625e-10
## 8 9 n100 5.119792e-18 5.119792e-12
## 9 10 n100 2.559896e-20 2.559896e-14
## 10 2 n1000 7.581633e-02 7.581633e+04
## 11 3 n1000 1.263606e-02 1.263606e+04
## 12 4 n1000 1.579507e-03 1.579507e+03
## 13 5 n1000 1.579507e-04 1.579507e+02
## 14 6 n1000 1.316256e-05 1.316256e+01
## 15 7 n1000 9.401827e-07 9.401827e-01
## 16 8 n1000 5.876142e-08 5.876142e-02
## 17 9 n1000 3.264523e-09 3.264523e-03
## 18 10 n1000 1.632262e-10 1.632262e-04
## 19 2 n2000 1.839397e-01 1.839397e+05
## 20 3 n2000 6.131324e-02 6.131324e+04
## 21 4 n2000 1.532831e-02 1.532831e+04
## 22 5 n2000 3.065662e-03 3.065662e+03
## 23 6 n2000 5.109437e-04 5.109437e+02
## 24 7 n2000 7.299195e-05 7.299195e+01
## 25 8 n2000 9.123994e-06 9.123994e+00
## 26 9 n2000 1.013777e-06 1.013777e+00
## 27 10 n2000 1.013777e-07 1.013777e-01
# sum up the totals by group using the aggregate function
q1g.melted.2_plus_cells = aggregate(. ~ Concentration, q1g.melted.2_plus_cells, sum)
q1g.melted.2_plus_cells
## Concentration Cell_count Density Expected_droplets
## 1 n100 54 0.001209104 1209.104
## 2 n1000 54 0.090204010 90204.010
## 3 n2000 54 0.264241108 264241.108
# relabel the cell counts as "2+"
q1g.melted.2_plus_cells$Cell_count = "2+"
q1g.melted.2_plus_cells
## Concentration Cell_count Density Expected_droplets
## 1 n100 2+ 0.001209104 1209.104
## 2 n1000 2+ 0.090204010 90204.010
## 3 n2000 2+ 0.264241108 264241.108
# combine the 1 and 2+ data
q1g.plot.data = rbind(q1f.melted.1cell, q1g.melted.2_plus_cells)
q1g.plot.data
## Cell_count Concentration Density Expected_droplets
## 1 1 n100 0.047561471 47561.471
## 2 1 n1000 0.303265330 303265.330
## 3 1 n2000 0.367879441 367879.441
## 4 2+ n100 0.001209104 1209.104
## 5 2+ n1000 0.090204010 90204.010
## 6 2+ n2000 0.264241108 264241.108
# Plot the combined data as a bar plot
ggplot(q1g.plot.data, aes(x=Concentration, y=Expected_droplets, fill=Cell_count)) +
geom_col(position="dodge2") + # plot side-by-side
theme_classic()
Which experimental concentration has the lowest doublet rate? Which concentration would you choose for your experiment?
# your answer here
The 100 cells / microliter concentration has the lowest doublet rate. Even though you get a lot more droplets with one cell/ul at 1000 or 2000 cells, the rate of doublets is also higher.
Since the goal is to minimize doublets, we will take a hit in total droplets with 1 cell. This tradeoff is worth it because otherwise we will not be able to deconvolute the data.
Above, we simulated idealized data for this experiment for the purposes of illustration. If we could optically measure the number of cells per droplet in an actual experiment, then we could see whether the data actually fit the expected distribution.
Let’s say that you’ve used a cell counter at the start of the experiment to estimate the concentration of cells. You targeted to get a concentration of 100 cells per microliter, but the actual concentration was 110 cells per microliter.
Now you want to see how well the actual distribution of cells per droplet will match the expected distribution.
xtabs() to make a table of the expected number of cells per droplet for the ideal n100 distribution.
Expected_droplets ~ Cell_count here.# sample 1M droplets from the Poisson distribution
sample.rpois = rpois(1e6, lambda=0.11 * 0.5)
head(sample.rpois)
## [1] 0 0 0 0 0 1
# make a frequency table from the sampled data
sample.rpois.table = table(sample.rpois)
sample.rpois.table
## sample.rpois
## 0 1 2 3
## 946807 51670 1496 27
# filter the n100 data from the ideal distributions
n100.data = q1c.data.frame.melted %>% filter(Concentration == "n100") %>% select(c(Cell_count, Expected_droplets))
# use xtabs to make a table of the ideal n100 data
ideal.rpois.table = xtabs(Expected_droplets ~ Cell_count, data = n100.data)
ideal.rpois.table
## Cell_count
## 0 1 2 3 4 5
## 9.512294e+05 4.756147e+04 1.189037e+03 1.981728e+01 2.477160e-01 2.477160e-03
## 6 7 8 9 10
## 2.064300e-05 1.474500e-07 9.215625e-10 5.119792e-12 2.559896e-14
Since the sampled data will have only up to 3 or 4 cells per droplet, you will notice that the two tables are not the same size! To generate the table we need for the Chi-square test, you will need to truncate the “ideal” table to match the length of the sampled data.
Go ahead and this, and then join them together into a single contingency table. You should end up with a table containing two rows and around 3-4 columns.
# truncate the table with the ideal data to the correct length
ideal.rpois.table = ideal.rpois.table[1:length(sample.rpois.table)]
ideal.rpois.table
## Cell_count
## 0 1 2 3
## 951229.42450 47561.47123 1189.03678 19.81728
# bind the data together into a single table
chisq.data = rbind(sample.rpois.table,
ideal.rpois.table)
chisq.data
## 0 1 2 3
## sample.rpois.table 946807.0 51670.00 1496.000 27.00000
## ideal.rpois.table 951229.4 47561.47 1189.037 19.81728
Plot the distribution of cell counts per droplet for the observed and expected data. How different do they look?
# make a bar plot to compare these visually
barplot(chisq.data, beside=T,
col=c("aquamarine3","coral"))
legend("topright", c("Observed","Expected Poisson"), pch=15,
col=c("aquamarine3","coral"),
bty="n")
Finally, perform a Chi-squared test of the observed vs. expected cell counts.
# do the chi-squared test
chisq.test( chisq.data )
##
## Pearson's Chi-squared test
##
## data: chisq.data
## X-squared = 216.61, df = 3, p-value < 2.2e-16
chisq.test( chisq.data )$p.value
## [1] 1.086926e-46
How do your sampled data compare to the ideal data in the Chi-squared test?
# your answer here
Even though the distributions look pretty close by eye, for most samples they will be highly statistically different. In fact sometimes the sampled data will contain even fewer samples with o or 2+ cells (even though the lambda is a little bit higher), just due to differences between random samples.
However, it looks like there should still be very few droplets with more than one cell, so you can probably go ahead with the experiment anyway (or you could just dilute your sample a little bit instead.)
This is a case where even though the statistics may be significant, the overall magnitude of the differences is relatively small, so in practice they won't matter that much.
Over the last few years continuous improvements in single-cell technology have been made to increase the proportion of droplets containing a single cell as well as throughput. Alternative methods such as split-pooling have also been developed that can be scaled to accommodate very large samples, and a variety of methods have been developed to allow multiplexing of multiple samples.
In a previous homework, we looked at the mouse high-fat diet dataset and computed confidence intervals for multiple samples of 12 mice from the control population. Since we didn’t know about the \(t\)-distribution, we just used a \(z\)-distribution to calculate confidence intervals.
You may or may not have noticed that when you re-ran the code for finding the 95% CI for samples from the control population, that usually slightly more than 5 samples out of 100 did not contain the true population mean. This is because 95%CIs for the \(z\)-distribution were too narrow and did not take into account the variability of small samples!
The dataset attached to this homework contains the cleaned dataset we made in that homework. First, load the dataset and create a vector containing just the weights for the male control population.
# load the dataset
mice.pheno = read.csv("mice.pheno.cleaned.csv")
# subset the bodyweights of the **male** mice fed the **chow** diet
chow = mice.pheno[which(mice.pheno$Sex == "M" & mice.pheno$Diet == "chow"),"Bodyweight"]
Below is reproduced the code from that homework for visualizing 100 95% confidence intervals, using just the chow data:
# ============================================================================ #
n.samples = 100 # number of random samples
N = 12 # sample size
Q = qnorm(0.975) # z-score for +/- 47.5% of a normal distribution
# set up a plot showing a vertical line with the true mean for each population
# rename chow and hf to reflect that we are using these as the population
chowPop = chow
hfPop = hf
## Error in eval(expr, envir, enclos): object 'hf' not found
plot(mean(chowPop)+c(-7,7),c(1,1),type="n",
xlab="weight",ylab="interval",ylim=c(1,n.samples))
abline(v=mean(chowPop))
# display the 95% CI for a bunch of random samples
#ci_range = c() # initialize empty vector
for (i in 1:n.samples) {
chow.sample = sample(chowPop,N) # take a random sample
se = sd(chow.sample)/sqrt(N) # compute SEM
interval = c(mean(chow.sample)-Q*se, mean(chow.sample)+Q*se) # compute 95% CI
# ci_range[i] = interval[2] - interval[1] # record size of ci interval
# draw the 95% CI -- color it green if it contains the true population mean, or red if not
covered = mean(chowPop) <= interval[2] & mean(chowPop) >= interval[1]
color = ifelse(covered,"forestgreen","red")
lines(interval, c(i,i), col=color)
}
#summary(ci_range) # display summary of the distribution of CI ranges
# ============================================================================ #
Re-run the above code block a bunch of times. Do you notice that usually more than 5 CI’s don’t contain the true mean?
Now turn the for loop above into a function called ci.fail.counter. Instead of having the code add a line to a graph, modify it so that all it does is return the number of CI’s out of 100 that DO NOT contain the true population mean.
The function should take the following parameters:
chowPop: Data for an entire population (vector)N: the sample size (default = 12)n.samples: the number of samples to take (default = 100)Note:
Q inside the function so it doesn’t depend on any other outside information.ci.fail.counter = function(chowPop, N=12, n.samples=100) {
counter = 0 # create a counter
Q = qnorm(0.975) # z-score for +/- 47.5% of a normal distribution
for (i in 1:n.samples) {
chow.sample = sample(chowPop,N) # take a random sample
se = sd(chow.sample)/sqrt(N) # compute SEM
interval = c(mean(chow.sample)-Q*se, mean(chow.sample)+Q*se) # compute 95% CI
# flag the 95% CI as either covering or not covering the true population mean
covered = mean(chowPop) <= interval[2] & mean(chowPop) >= interval[1]
# add to counter if CI doesn't cover the population mean
if (!covered) {counter = counter+1}
}
# return the total number of CI's that don't contain the population mean
return(counter)
}
Below, test your function by calling it once.
When everything looks ok, run the function 100 times and record the number of CI’s that don’t cover the true population mean each time.
Check the results by making a table of the counts.
# call the function once
ci.fail.counter(chowPop = chowPop)
## [1] 7
# initialize an empty vector
ci.fail.count = c()
# run the function 100 times and fill up the vector with the results from each run
## option 1: use replicate function
ci.fail.count = replicate(100, ci.fail.counter(chowPop), simplify = TRUE)
## option 2: use a for loop
for (i in 1:100) {
ci.fail.count[i] = ci.fail.counter(chowPop)
}
# take a peek at the first few rows of the vector
head(ci.fail.count)
## [1] 8 11 4 7 5 7
# make a table to check the results
table(ci.fail.count)
## ci.fail.count
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14
## 2 1 5 9 13 19 12 10 12 7 6 2 1 1
Make a quick plot of the number of 95%CIs per 100 CI’s that don’t contain the true population mean. How does the distribution look? What are the mean and median failed CI’s using the \(z\)-distribution?
hist(ci.fail.count, col="peachpuff",
xlab="Number of failed 95%CIs",
main="95%CI failure rate per 100 using z-score")
mean(ci.fail.count)
## [1] 6.93
median(ci.fail.count)
## [1] 7
Copy your function above and modify it so that instead of using the \(z\)-distribution, it uses the 95% quantiles for a \(t\)-distribution with the right number of d.f.
ci.fail.counter = function(chowPop, N=12, n.samples=100) {
counter = 0 # create a counter
Q = qt(0.975, df = N-1) # z-score for +/- 47.5% of a normal distribution
for (i in 1:n.samples) {
chow.sample = sample(chowPop,N) # take a random sample
se = sd(chow.sample)/sqrt(N) # compute SEM
interval = c(mean(chow.sample)-Q*se, mean(chow.sample)+Q*se) # compute 95% CI
# flag the 95% CI as either covering or not covering the true population mean
covered = mean(chowPop) <= interval[2] & mean(chowPop) >= interval[1]
# add to counter if CI doesn't cover the population mean
if (!covered) {counter = counter+1}
}
# return the total number of CI's that don't contain the population mean
return(counter)
}
Repeat part (b) using the new version of your function.
ci.fail.count = c()
for (i in 1:100) {
ci.fail.count[i] = ci.fail.counter(chowPop)
}
head(ci.fail.count)
## [1] 7 8 7 4 3 4
table(ci.fail.count)
## ci.fail.count
## 1 2 3 4 5 6 7 8 9 11
## 2 8 14 21 20 16 8 5 5 1
Repeat part (c) using the new data based on the \(t\)-distribution.
hist(ci.fail.count, col="skyblue",
xlab="Number of failed 95%CIs",
main="95%CI failure rate per 100 using t-score")
mean(ci.fail.count)
## [1] 4.92
median(ci.fail.count)
## [1] 5
How do these results look in comparison to your earlier results?
# your answer here
Using the z-score typically gives a mean/median failure rate of around 7%.
Using the t-score gives a failure rate much closer to 5% - in fact, re-running this a bunch of times oten finds the mean and median to be closer to 4! This suggest that the t-distribution may actually slightly overestimate the width of the CI much of the time.